Skip to content

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461

Open
vinay553 wants to merge 2 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item
Open

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
vinay553 wants to merge 2 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item

Conversation

@vinay553
Copy link
Copy Markdown
Contributor

@vinay553 vinay553 commented May 18, 2026

Summary

Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.

  • Adds a new phash: Optional[str] field to the DatasetItem dataclass — 64-character "0/1" binary string when backfilled by the backend, None otherwise.
  • Threads phash=payload.get(PHASH_KEY) into DatasetItem.from_json. Because every SDK method that returns a DatasetItem goes through from_json, this single change exposes item.phash on:
    • items_and_annotation_generator
    • items_generator / dataset.items
    • query_items
    • iloc / refloc / loc
  • Bumps version 0.18.2 → 0.18.3 and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.
  • Adds a top-level CLAUDE.md capturing release workflow, branch/PR conventions, and the from_json-centralization insight for future agent sessions.

Test plan

  • Local install: poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)" prints the hash.
  • DatasetItem.from_json falls back to None when the backend omits phash (existing test fixtures).
  • Smoke test against fedramp-prod once the paired backend PR is deployed: client.get_dataset(...).items_and_annotation_generator(...) yields items with item.phash populated.
  • CHANGELOG renders correctly on the release tag.

🤖 Generated with Claude Code

Greptile Summary

This PR exposes the backend-computed perceptual hash (phash) on DatasetItem by adding a new Optional[str] field and threading it through the single from_json deserialization entry point, making it available across all SDK methods that return items.

  • Adds PHASH_KEY = "phash" constant and phash: Optional[str] = None field to the DatasetItem dataclass; phash is intentionally omitted from to_payload since it is read-only and computed by the Nucleus backend.
  • Bumps the version to 0.18.3 and prepends a matching CHANGELOG entry following the project's Keep-a-Changelog convention.
  • Adds CLAUDE.md to document the repo's release workflow and architecture for future AI-assisted sessions.

Confidence Score: 5/5

This is a purely additive, backwards-compatible change — a new optional field defaulting to None with no effect on existing serialization or upload paths.

The change is minimal and surgical: one constant, one dataclass field, one payload.get call in from_json. The field defaults to None, so all existing callers and test fixtures continue to work unchanged. to_payload is correctly left untouched since phash is backend-computed and should not be round-tripped in uploads.

No files require special attention.

Important Files Changed

Filename Overview
nucleus/dataset_item.py Adds phash: Optional[str] = None field to the dataclass and threads it through from_json; to_payload intentionally omits it since phash is backend-computed and read-only.
nucleus/constants.py Adds PHASH_KEY = "phash" constant in alphabetical order, following the existing naming convention.
CHANGELOG.md Prepends a v0.18.3 Keep-a-Changelog entry documenting the new phash field.
CLAUDE.md New documentation file for AI agent sessions; no code impact.
pyproject.toml Version bumped from 0.18.2 to 0.18.3 to match the additive, backwards-compatible field addition.

Sequence Diagram

sequenceDiagram
    participant C as Caller (user code)
    participant SDK as SDK method<br/>(items_generator / query_items / iloc / etc.)
    participant FJ as DatasetItem.from_json
    participant API as Nucleus REST API

    C->>SDK: call SDK method
    SDK->>API: GET /v1/nucleus/...
    API-->>SDK: JSON payload (includes "phash" when backfilled)
    SDK->>FJ: from_json(payload)
    FJ->>FJ: payload.get(PHASH_KEY)  →  phash or None
    FJ-->>SDK: "DatasetItem(phash="0101...", ...)"
    SDK-->>C: DatasetItem with .phash populated
Loading

Reviews (2): Last reviewed commit: "Tighten phash field comment" | Re-trigger Greptile

vinay553 and others added 2 commits May 18, 2026 18:50
Add a `phash` field to the DatasetItem dataclass and thread it through
`from_json`. Because every SDK method that returns a DatasetItem
(items_and_annotation_generator, items_generator, query_items,
dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json,
exposing the field there is sufficient — no per-method changes required.

Also adds a top-level CLAUDE.md with release/branch conventions and
architecture pointers for future Claude Code sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant